R for Chemistry Data Analysis and Chemometrics
Visual Analytics in R
Jordi Cuadros, Vanessa Serrano
January 2022

Package Management

Installation and Loading of R Packages

R has many contributed to solve specific needs and problems in several disciplines. These are usually hosted in repositories, such as CRAN (https://cran.r-project.org/), Bioconductor (https://www.bioconductor.org/), R-Forge (https://r-forge.r-project.org/), ROpenSci (https://ropensci.org/), Stan packages (http://mc-stan.org/r-packages/)… Yet other packages are hosted in version control websites as GitHub or GitLab.

In all cases, to make use of a package, it has to be installed and loaded in memory.

installed.packages()[,1] # Lists the installed packages
(.packages())            # Lists the packages currently loaded

Some relevant resources to find useful packages are https://www.rdocumentation.org/ and https://cran.rstudio.com/web/views/.

To install a package, we use

install.packages("dplyr")

To load a package in memory, we use

library("dplyr")

Alternatively, RStudio offers a menu-based package management feature.

In a script, to avoid repeated installations of a package and to ensure its availability, we can use

if(!require("dplyr")) {
  install.packages("dplyr",
                   repos="https://cloud.r-project.org/")
  library("dplyr")
}

If the package is installed but not loaded, we can still access its functions by using the ::operator.

dplyr::band_instruments

To browse the help files for a packages, run

help(package="dplyr")

Additionnaly some packages have a vignette and some demo code that can be inspected with the vignette() and demo() functions.

Last, to learn more about packages and package development, you may want to read https://r-pkgs.org/index.html.

Data Import and Export

Data Files

It is often useful to save and load data for its transfer among different application or just to keep a copy of it for back-up or later use.

Although R can read and write many different types of files, we will focus on the most relevant format for statistics, chemistry and chemometrics. We will discuss here how to handle

  • text files,
  • structured data files, such as XML and JSON,
  • files from other computational programs (i.e. Excel and Matlab),
  • RData files, the default format for R, and
  • some data files which are specific for chemistry and chemical analysis.

A Previous: Set the Working Directory

Before working with external files, it is important to set the working directory. setwd and getwd functions can be used to set the working directory from the R code, although its use is not recommended in scripts that may need to be shared.

RStudio allows setting the working directory from the Session menu and from the Files pane.

Text Files

Let’s start from the wine data set in the FactoMineR package.

if(!require("FactoMineR")) {
  install.packages("FactoMineR",
                   repos="https://cloud.r-project.org/")
  library("FactoMineR")
}
data("wine")
help("wine")
str(wine)
## 'data.frame':    21 obs. of  31 variables:
##  $ Label                        : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
##  $ Soil                         : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
##  $ Odor.Intensity.before.shaking: num  3.07 2.96 2.86 2.81 3.61 ...
##  $ Aroma.quality.before.shaking : num  3 2.82 2.93 2.59 3.43 ...
##  $ Fruity.before.shaking        : num  2.71 2.38 2.56 2.42 3.15 ...
##  $ Flower.before.shaking        : num  2.28 2.28 1.96 1.91 2.15 ...
##  $ Spice.before.shaking         : num  1.96 1.68 2.08 2.16 2.04 ...
##  $ Visual.intensity             : num  4.32 3.22 3.54 2.89 4.39 ...
##  $ Nuance                       : num  4 3 3.39 2.79 4.04 ...
##  $ Surface.feeling              : num  3.27 2.81 3 2.54 3.38 ...
##  $ Odor.Intensity               : num  3.41 3.37 3.25 3.16 3.54 ...
##  $ Quality.of.odour             : num  3.31 3 2.93 2.88 3.36 ...
##  $ Fruity                       : num  2.88 2.56 2.77 2.39 3.16 ...
##  $ Flower                       : num  2.32 2.44 2.19 2.08 2.23 ...
##  $ Spice                        : num  1.84 1.74 2.25 2.17 2.15 ...
##  $ Plante                       : num  2 2 1.75 2.3 1.76 ...
##  $ Phenolic                     : num  1.65 1.38 1.25 1.48 1.6 ...
##  $ Aroma.intensity              : num  3.26 2.96 3.08 2.54 3.62 ...
##  $ Aroma.persistency            : num  2.96 2.81 2.8 2.58 3.3 ...
##  $ Aroma.quality                : num  3.2 2.93 3.08 2.48 3.46 ...
##  $ Attack.intensity             : num  2.96 3.04 3.22 2.7 3.46 ...
##  $ Acidity                      : num  2.11 2.11 2.18 3.18 2.57 ...
##  $ Astringency                  : num  2.43 2.18 2.25 2.18 2.54 ...
##  $ Alcohol                      : num  2.5 2.65 2.64 2.5 2.79 ...
##  $ Balance                      : num  3.25 2.93 3.32 2.33 3.46 ...
##  $ Smooth                       : num  2.73 2.5 2.68 1.68 3.04 ...
##  $ Bitterness                   : num  1.93 1.93 2 1.96 2.07 ...
##  $ Intensity                    : num  2.86 2.89 3.07 2.46 3.64 ...
##  $ Harmony                      : num  3.14 2.96 3.14 2.04 3.64 ...
##  $ Overall.quality              : num  3.39 3.21 3.54 2.46 3.74 ...
##  $ Typical                      : num  3.25 3.04 3.18 2.25 3.44 ...

To write a delimited text file from a data frame, we use the write.table function or any of its derivative functions. To read, read.table is to be used.

write.table(wine, file="wine.csv", sep=",", dec=".",
            quote=TRUE, fileEncoding="UTF-8", row.names=FALSE)
wine2 <- read.table("wine.csv", sep=",", dec=".",
           quote="\"", fileEncoding="UTF-8", header=TRUE)

In case we want a tab-delimited text file, we will use \t for the argument sep.

Other options exist for managing text-files.

In case the format is not well-defined, we may want to use readLines and writeLines to read or write the file line by line.

If the file is large, read_table from the readr package or vroom from the vroom package are usually faster.

Structured Text Files – XML

An example of an XML file containing the structure of caffeine can be found at https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/2519/record/XML. It can be read in R with read_xml from the xml2 package.

if(!require("xml2")) {
  install.packages("xml2",
                   repos="https://cloud.r-project.org/")
  library("xml2")
}

Information can be extracted using XPath selectors, https://www.w3schools.com/xml/xpath_intro.asp.

xmlObj <- read_xml(
  "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/2519/record/XML",
  options = "RECOVER")
xml_ns_strip(xmlObj)  # Important!
xmlObj_Name <- xml_find_all(xmlObj,xpath=
  ".//PC-InfoData//*[text()='Traditional']/../../..//PC-InfoData_value_sval/text()")
as.character(xmlObj_Name)
## [1] "caffeine"

NMR spectra can be obtained in CML (Chemistry Markup Language, http://www.xml-cml.org/spec/) format from https://www.nmrshiftdb.org.

For example, the 13C-NMR spectra for butane is available at https://www.nmrshiftdb.org/NmrshiftdbServlet/nmrshiftdbaction/searchorpredict/smiles/CCCC/spectrumtype/13C.

Structured Text Files – JSON

JSON files are another common format for data storage and transmission. In R, the jsonlite package offers functions to import and convert this files to list or data frames.

if(!require("jsonlite")) {
  install.packages("jsonlite",
                   repos="https://cloud.r-project.org/")
  library("jsonlite")
}

jsBz <- fromJSON(
  "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/241/JSON")
jsBz_NI <- jsBz$Record$Section[
  jsBz$Record$Section$TOCHeading=="Names and Identifiers",]
jsBz_NI_OI <- jsBz_NI$Section[[1]][
  jsBz_NI$Section[[1]]$TOCHeading=="Other Identifiers",]
jsBz_NI_OI$Section[[1]]$TOCHeading
##  [1] "CAS"                            "Related CAS"                   
##  [3] "Deprecated CAS"                 "European Community (EC) Number"
##  [5] "ICSC Number"                    "NSC Number"                    
##  [7] "RTECS Number"                   "UN Number"                     
##  [9] "UNII"                           "DSSTox Substance ID"           
## [11] "Wikipedia"                      "NCI Thesaurus Code"
CAS <- jsBz_NI_OI$Section[[1]][jsBz_NI_OI$Section[[1]]$TOCHeading=="CAS",]
table(unlist(CAS$Information[[1]]$Value$StringWithMarkup))
## 
## 26181-88-4    71-43-2 
##          1         13

Computational Programs – Excel

Let’s use the wine data set again.

str(wine)
## 'data.frame':    21 obs. of  31 variables:
##  $ Label                        : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
##  $ Soil                         : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
##  $ Odor.Intensity.before.shaking: num  3.07 2.96 2.86 2.81 3.61 ...
##  $ Aroma.quality.before.shaking : num  3 2.82 2.93 2.59 3.43 ...
##  $ Fruity.before.shaking        : num  2.71 2.38 2.56 2.42 3.15 ...
##  $ Flower.before.shaking        : num  2.28 2.28 1.96 1.91 2.15 ...
##  $ Spice.before.shaking         : num  1.96 1.68 2.08 2.16 2.04 ...
##  $ Visual.intensity             : num  4.32 3.22 3.54 2.89 4.39 ...
##  $ Nuance                       : num  4 3 3.39 2.79 4.04 ...
##  $ Surface.feeling              : num  3.27 2.81 3 2.54 3.38 ...
##  $ Odor.Intensity               : num  3.41 3.37 3.25 3.16 3.54 ...
##  $ Quality.of.odour             : num  3.31 3 2.93 2.88 3.36 ...
##  $ Fruity                       : num  2.88 2.56 2.77 2.39 3.16 ...
##  $ Flower                       : num  2.32 2.44 2.19 2.08 2.23 ...
##  $ Spice                        : num  1.84 1.74 2.25 2.17 2.15 ...
##  $ Plante                       : num  2 2 1.75 2.3 1.76 ...
##  $ Phenolic                     : num  1.65 1.38 1.25 1.48 1.6 ...
##  $ Aroma.intensity              : num  3.26 2.96 3.08 2.54 3.62 ...
##  $ Aroma.persistency            : num  2.96 2.81 2.8 2.58 3.3 ...
##  $ Aroma.quality                : num  3.2 2.93 3.08 2.48 3.46 ...
##  $ Attack.intensity             : num  2.96 3.04 3.22 2.7 3.46 ...
##  $ Acidity                      : num  2.11 2.11 2.18 3.18 2.57 ...
##  $ Astringency                  : num  2.43 2.18 2.25 2.18 2.54 ...
##  $ Alcohol                      : num  2.5 2.65 2.64 2.5 2.79 ...
##  $ Balance                      : num  3.25 2.93 3.32 2.33 3.46 ...
##  $ Smooth                       : num  2.73 2.5 2.68 1.68 3.04 ...
##  $ Bitterness                   : num  1.93 1.93 2 1.96 2.07 ...
##  $ Intensity                    : num  2.86 2.89 3.07 2.46 3.64 ...
##  $ Harmony                      : num  3.14 2.96 3.14 2.04 3.64 ...
##  $ Overall.quality              : num  3.39 3.21 3.54 2.46 3.74 ...
##  $ Typical                      : num  3.25 3.04 3.18 2.25 3.44 ...

To write a data.frame to Excel, we can use the write_xlsx from the writexl package. To read an Excel file, read_xlsx from the readxl package can be used.

if(!require("writexl")) {
  install.packages("writexl",
                   repos="https://cloud.r-project.org/")
  library("writexl")
}

if(!require("readxl")) {
  install.packages("readxl",
                   repos="https://cloud.r-project.org/")
  library("readxl")
}
write_xlsx(list(wineSheet = wine),
                  path = "wine.xlsx")
wineExcel <- readxl::read_xlsx(path = "wine.xlsx")

Computational Programs – Matlab

To open and write Matlab data files, we can use the R.matlab package. Relevant functions are called readMat and writeMat.

if(!require("R.matlab")) {
  install.packages("R.matlab",
                   repos="https://cloud.r-project.org/")
  library("R.matlab")
}
writeMat("wine.mat", wine = wine)
wineMat <-readMat("wine.mat")
wineMat <- data.frame(as.character(unlist(wineMat[[1]][[1]])),
                      as.character(unlist(wineMat[[1]][2])),
                      as.data.frame(wineMat[[1]][3:31]))

Reading a Matlab data file returns a list that will need ad hoc manipulation.

RData Files

To save and read data in the R data format, we use save and load. These allow storing and restoring any set of variables of the environment. When loaded, variables are recovered with the same names they had on saving.

save(wine, file="wine.rda")
print(load("wine.rda"))          # print shows the name of the loaded objects
wineR <- get(load("wine.rda"))   # get allows storing the loaded information 
                                 # with a different name

Chemical Data Formats

Besides the general data file formats already discussed, there are several formats used specifically to store chemical information. Some significant ones are

More information on chemical data files can be found at https://en.wikipedia.org/wiki/Chemical_file_format, https://en.wikipedia.org/wiki/Mass_spectrometry_data_format, http://unichrom.com/chrom/uc-ffe.shtml or https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518119/.

Chemical Data Formats – Chemical Table File

Chemical Table Files are a set of text-based file types designed to store molecular information, e.g. positions of the atoms and connection tables. Some of this format allow storing additional information such as conformers, additional properties and identifiers, or spectra.

Currently the simplest way to read a MOL or SDF file in R is the read.SDFset function in the ChemmineR Bioconductor package.

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager",
                   repos="https://cloud.r-project.org/")

if (!require("ChemmineR", quietly = TRUE)) {
  BiocManager::install("ChemmineR", update=FALSE)
  library("ChemmineR")
}
sdf1 <- read.SDFset(
  "https://cactus.nci.nih.gov/chemical/structure/CCCC/file?format=sdf")
draw_sdf(sdf1[[1]], filename=NULL)

ChemmineR::header(sdf1)
## $CMP1
##                                    Molecule_Name 
##                                          "C4H10" 
##                                           Source 
## "APtclcactv01172209173D 0   0.00000     0.00000" 
##                                          Comment 
##                                              " " 
##                                      Counts_Line 
##        " 14 13  0  0  0  0  0  0  0  0999 V2000"
MW(sdf1[[1]])
##     CMP 
## 58.1222

Chemical Data Formats – JCAMP-DX

JCAMP-DX is an text-based open standard to store and distribute spectral data. Currently the best way to read these files in R is the readJDX function in the readJDX package.

if (!require("readJDX")) {
  install.packages("readJDX",
                   repos="https://cloud.r-project.org/")
  library("readJDX")
}

Acetone IR

jdx1 <- readJDX ("../data/67-64-1-IR.jdx")
plot(jdx1$Acetone$x,jdx1$Acetone$y, type="l",
     xlab="wave number", ylab="T")

JCAMP-DX downloaded from https://webbook.nist.gov/cgi/cbook.cgi?ID=C67641, in NIST Chemistry WebBook, NIST Standard Reference Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, 20899, https://doi.org/10.18434/T4D303, (on January 5, 2022).

Chemical Data Formats – NIST MSP

MSP is a text-based NIST-promoted format to store collections of spectra. It’s a common format in the metabolomics community.

Downloadable spectral databases can be found at

While a functional package to read MSP files doesn’t seem to be available, these files can be read with standard functions for text files.

    msp1 <- readLines("../data/MSMS-Neg-MassBankEU.msp")
    msp1 <- paste(msp1, collapse="\n")
    msp1 <- unlist(strsplit(msp1, "\n\n"))

    msp1_1 <- unlist(strsplit(msp1[1],"\n"))
    spectrum <- msp1_1[1:length(msp1_1) > which(substr(msp1_1,1,10)=="Num Peaks:")]
    spectrum <- read.table(text=spectrum, sep="\t")
    colnames(spectrum) <- c("m_z","int")
    spectrum
##        m_z  int
## 1  96.9602   70
## 2 241.0031 1000
    spectrum <- data.frame(m_z=rep(spectrum$m_z,each=3),
                          int=rep(spectrum$int,each=3),
                          i=1:3)
    spectrum$int[spectrum$i!=2] <- 0

    plot(x=spectrum$m_z, y=spectrum$int, xlim=c(40,300),
         type="l", xlab="m/z", ylab="")

::: {.bibref} MSMS-Neg-MassBankEU.msp downloaded from http://prime.psc.riken.jp/compms/msdial/main.html#MSP

YOUR TURN

  1. Import the classical wine data of Forina et al. It is available at http://archive.ics.uci.edu/ml/datasets/wine.
  2. Load the data from the NIR spectra for diesel fuels which is described and available at https://eigenvector.com/resources/data-sets/.
  3. Import the data from mango aroma components available at https://figshare.com/articles/dataset/Mango_Mangifera_indica_Aroma_Discriminate_Cultivars_and_Ripeness_Stages/14303913. Warning: it needs some cleaning!
  4. Extract the MS of lactic acid from the Golm database, at http://gmd.mpimp-golm.mpg.de/download/ and plot it. It is available in the MDN35 ALK data set.

Advanced manipulation of data frames (part 2 - the tidy way)

Data frame

As already presented, we usually have to arrange data we have in a data frame by

  • renaming columns or rows,
  • adding columns or rows,
  • segmenting (selecting columns or rows),
  • removing columns or rows,
  • creating a data frame from combining vectors,
  • joining data frames,
  • reshaping data tables, and
  • summarizing or aggregating the data.

Many of these operations can be performed by using a set of functions that operate in a single common paradigm, sometimes defined as grammar for data manipulation, implemented in the dplyr and tidyr R packages.

We usually install and load both packages jointly with some other packages that work nicely together. These are grouped in the tidyverse package.

if (!require("tidyverse")) {
  install.packages("tidyverse",
                   repos="https://cloud.r-project.org/")
  library("tidyverse")
}

Functions

R supports a functional approach to programming where functions can be defined and transferred as arguments. Many calculations in R are integrated as functions. Functions are a critical concept in a tidy data management.

Functions are defined with the function reserved word (or \ since R 4.1.0).

sq <- function(x) {x ^ 2}
pow <- \(x,y) {x ^ y}

sq(4)
## [1] 16
pow(3,4)
## [1] 81

Pipe Operators, %>% and |>

Pipes are a syntax variant for transferring arguments to functions. It allows for a more process oriented organization of the code. R includes two types of pipes: %>% from the magrittr package and |> included in R 4.1.0.

sq(4)
## [1] 16
4 %>% sq    # With %>%, parenthesis are not required when there is no arguments
## [1] 16
4 |> sq()
## [1] 16
pow(sq(sq(2)),3)
## [1] 4096
2 %>% sq %>% sq %>% pow(3)
## [1] 4096

By default, the left operand of the pipe is sent as the first argument of the function on the right. It can be sent to a different position in the calculation by including a dot (.) in the right-hand side of the pipe.

pow(sq(sq(2)),sq(sq(2))/100)
## [1] 1.558329
2 %>% sq %>% sq(.) %>% pow(.,./100)
## [1] 1.558329

dplyr and tidyr

dplyr and tidyr are two packages of the tidyverse designed for manipulating and tidying data. A full description of both packages can be found at https://dplyr.tidyverse.org/ and https://tidyr.tidyverse.org/.

Current cheatsheets for both packages can be found at https://www.rstudio.com/resources/cheatsheets/.

Data manipulation in these packages is commonly done using a limited set of functions. The most important are

  • select and filter, for subsetting,
  • mutate, for creating or updating calculated columns,
  • arrange, for sorting,
  • pivot_wider and pivot_longer, for reshaping, and
  • group by and summarize, for aggregating.

Let’s see how we operate in this logic, using as an example the wine data set.

str(wine) 
## 'data.frame':    21 obs. of  31 variables:
##  $ Label                        : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
##  $ Soil                         : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
##  $ Odor.Intensity.before.shaking: num  3.07 2.96 2.86 2.81 3.61 ...
##  $ Aroma.quality.before.shaking : num  3 2.82 2.93 2.59 3.43 ...
##  $ Fruity.before.shaking        : num  2.71 2.38 2.56 2.42 3.15 ...
##  $ Flower.before.shaking        : num  2.28 2.28 1.96 1.91 2.15 ...
##  $ Spice.before.shaking         : num  1.96 1.68 2.08 2.16 2.04 ...
##  $ Visual.intensity             : num  4.32 3.22 3.54 2.89 4.39 ...
##  $ Nuance                       : num  4 3 3.39 2.79 4.04 ...
##  $ Surface.feeling              : num  3.27 2.81 3 2.54 3.38 ...
##  $ Odor.Intensity               : num  3.41 3.37 3.25 3.16 3.54 ...
##  $ Quality.of.odour             : num  3.31 3 2.93 2.88 3.36 ...
##  $ Fruity                       : num  2.88 2.56 2.77 2.39 3.16 ...
##  $ Flower                       : num  2.32 2.44 2.19 2.08 2.23 ...
##  $ Spice                        : num  1.84 1.74 2.25 2.17 2.15 ...
##  $ Plante                       : num  2 2 1.75 2.3 1.76 ...
##  $ Phenolic                     : num  1.65 1.38 1.25 1.48 1.6 ...
##  $ Aroma.intensity              : num  3.26 2.96 3.08 2.54 3.62 ...
##  $ Aroma.persistency            : num  2.96 2.81 2.8 2.58 3.3 ...
##  $ Aroma.quality                : num  3.2 2.93 3.08 2.48 3.46 ...
##  $ Attack.intensity             : num  2.96 3.04 3.22 2.7 3.46 ...
##  $ Acidity                      : num  2.11 2.11 2.18 3.18 2.57 ...
##  $ Astringency                  : num  2.43 2.18 2.25 2.18 2.54 ...
##  $ Alcohol                      : num  2.5 2.65 2.64 2.5 2.79 ...
##  $ Balance                      : num  3.25 2.93 3.32 2.33 3.46 ...
##  $ Smooth                       : num  2.73 2.5 2.68 1.68 3.04 ...
##  $ Bitterness                   : num  1.93 1.93 2 1.96 2.07 ...
##  $ Intensity                    : num  2.86 2.89 3.07 2.46 3.64 ...
##  $ Harmony                      : num  3.14 2.96 3.14 2.04 3.64 ...
##  $ Overall.quality              : num  3.39 3.21 3.54 2.46 3.74 ...
##  $ Typical                      : num  3.25 3.04 3.18 2.25 3.44 ...

Renaming columns or rows

newColnames <- c("label", "soil", "odorBf", "aromaBf",
                 "fruityBf", "flowerBf", "spiceBf",
                 "visual", "nuance", "surface", "odorI",
                 "odorQ", "fruity", "flower", "spice",
                 "plante", "phenolic", "aromaI", "aromaP",
                 "aromaQ","attackI", "acidity", "astringency",
                 "alcohol", "balance", "smooth", "bitterness",
                 "intensity", "harmony", "overall", "typical")
wine2 <- wine %>% rename_with(function(x) newColnames)
colnames(wine2)
##  [1] "label"       "soil"        "odorBf"      "aromaBf"     "fruityBf"   
##  [6] "flowerBf"    "spiceBf"     "visual"      "nuance"      "surface"    
## [11] "odorI"       "odorQ"       "fruity"      "flower"      "spice"      
## [16] "plante"      "phenolic"    "aromaI"      "aromaP"      "aromaQ"     
## [21] "attackI"     "acidity"     "astringency" "alcohol"     "balance"    
## [26] "smooth"      "bitterness"  "intensity"   "harmony"     "overall"    
## [31] "typical"
wine2 <- wine2 %>% rownames_to_column() %>%
  mutate(rowname = paste0("W",sprintf("%0.2d",1:21)))
wine2$rowname
##  [1] "W01" "W02" "W03" "W04" "W05" "W06" "W07" "W08" "W09" "W10" "W11" "W12"
## [13] "W13" "W14" "W15" "W16" "W17" "W18" "W19" "W20" "W21"

Adding columns or rows

wine2 <- wine2 %>% mutate(newCol=1:21)
head(wine2[,c(1,30:33)])
##   rowname harmony overall typical newCol
## 1     W01   3.143   3.393   3.250      1
## 2     W02   2.964   3.214   3.036      2
## 3     W03   3.143   3.536   3.179      3
## 4     W04   2.038   2.464   2.250      4
## 5     W05   3.643   3.741   3.444      5
## 6     W06   3.500   3.643   3.393      6
wine2 <- wine2 %>% add_row(wine2[21,])
tail(wine2[,c(1,30:33)])
##    rowname harmony overall typical newCol
## 17     W17   3.250   3.536   3.269     17
## 18     W18   3.074   3.464   3.444     18
## 19     W19   2.107   2.370   2.321     19
## 20     W20   2.741   2.643   2.571     20
## 21     W21   3.000   2.852   2.750     21
## 22     W21   3.000   2.852   2.750     21

Segmenting

wine2 %>% select(rowname,label,soil,overall) %>% 
  filter(overall>3.6)
##   rowname      label      soil overall
## 1     W05     Saumur Reference   3.741
## 2     W06 Bourgueuil Reference   3.643
## 3     W07 Bourgueuil Reference   3.714
## 4     W14     Saumur Reference   3.929
## 5     W15 Bourgueuil      Env1   3.643
## 6     W16 Bourgueuil Reference   3.750
wine2 %>% rowwise() %>% mutate(worse=min(c_across(4:30))) %>% 
  filter(worse<1.5) %>% select(rowname, worse)
## # A tibble: 8 x 2
## # Rowwise: 
##   rowname worse
##   <chr>   <dbl>
## 1 W02      1.38
## 2 W03      1.25
## 3 W04      1.48
## 4 W06      1.48
## 5 W07      1.48
## 6 W16      1.4 
## 7 W18      1.43
## 8 W19      1.32
wine2 %>% select(overall) %>% str
## 'data.frame':    22 obs. of  1 variable:
##  $ overall: num  3.39 3.21 3.54 2.46 3.74 ...
wine2 %>% pull(overall) %>% str
##  num [1:22] 3.39 3.21 3.54 2.46 3.74 ...

Removing columns or rows

wine2 <- wine2 %>% select(-newCol)
tail(wine2[,30:ncol(wine2)])
##    harmony overall typical
## 17   3.250   3.536   3.269
## 18   3.074   3.464   3.444
## 19   2.107   2.370   2.321
## 20   2.741   2.643   2.571
## 21   3.000   2.852   2.750
## 22   3.000   2.852   2.750
wine3 <- wine2 %>% filter(overall<3.5)
wine3 %>% pull(rowname)
##  [1] "W01" "W02" "W04" "W08" "W09" "W10" "W12" "W18" "W19" "W20" "W21" "W21"

Joining data frames

qualityLevel <- data.frame(level=1:4,
  qualityNom = c("Horrendous","Bad", "Good", "Excellent"))
wine2 <- wine2 %>% mutate(level=round(overall,0)) %>% 
  inner_join(qualityLevel)
## Joining, by = "level"
head(wine2[,30:ncol(wine2)])
##   harmony overall typical level qualityNom
## 1   3.143   3.393   3.250     3       Good
## 2   2.964   3.214   3.036     3       Good
## 3   3.143   3.536   3.179     4  Excellent
## 4   2.038   2.464   2.250     2        Bad
## 5   3.643   3.741   3.444     4  Excellent
## 6   3.500   3.643   3.393     4  Excellent

Reshaping data tables

The same information can be presented in many different ways, even in a data table. These are commonly referred as formats o shapes.

Having tidy data is critical to facilitate data preparation, analysis and visualization.

Data is tidy when structured in data table where every row refers to an individual, an observational unit, or an observation; and each column refers to a variable or property of the observational unit. Data is untidy if not rectangular, if different observational types are in the same data table, if different rows refer to the same observation or observational unit, if a column contains more than one type of measurement, if values of variable are in the column headers…

Wickham, H. (2014). Tidy data. Journal of statistical software, 59(1), 1-23.

When we have many variables that correspond to the same type of measurement, a data frame can be organized in two formats: a long format and a wide format.

In wide format, rows are observational units and there are different columns for the different variables being measured.

In long format, rows are observations and there is a column for the values and a column for the variable being measured.

Different analyses and visualizations will require having the data organized in one or other format.

Please be aware that tidy data can be in wide or long format, and that data in long or wide formats can still be untidy.

The data set wine is available in a wide format. Each row is a type of wine.

Let’s convert it to long…

wineL <- wine %>% rownames_to_column() %>% 
  pivot_longer(4:32,
               names_to = "determination",
               values_to = "value")
str(wineL)
## tibble [609 x 5] (S3: tbl_df/tbl/data.frame)
##  $ rowname      : chr [1:609] "2EL " "2EL " "2EL " "2EL " ...
##  $ Label        : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Soil         : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ determination: chr [1:609] "Odor.Intensity.before.shaking" "Aroma.quality.before.shaking" "Fruity.before.shaking" "Flower.before.shaking" ...
##  $ value        : num [1:609] 3.07 3 2.71 2.28 1.96 ...
head(wineL)
## # A tibble: 6 x 5
##   rowname Label  Soil  determination                 value
##   <chr>   <fct>  <fct> <chr>                         <dbl>
## 1 "2EL "  Saumur Env1  Odor.Intensity.before.shaking  3.07
## 2 "2EL "  Saumur Env1  Aroma.quality.before.shaking   3   
## 3 "2EL "  Saumur Env1  Fruity.before.shaking          2.71
## 4 "2EL "  Saumur Env1  Flower.before.shaking          2.28
## 5 "2EL "  Saumur Env1  Spice.before.shaking           1.96
## 6 "2EL "  Saumur Env1  Visual.intensity               4.32

Let’s go back to wide…

wineW <- pivot_wider(wineL, id_cols = 1:3,
                      names_from = "determination",
                      values_from = "value")
str(wineW)
## tibble [21 x 32] (S3: tbl_df/tbl/data.frame)
##  $ rowname                      : chr [1:21] "2EL " "1CHA" "1FON" "1VAU" ...
##  $ Label                        : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
##  $ Soil                         : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
##  $ Odor.Intensity.before.shaking: num [1:21] 3.07 2.96 2.86 2.81 3.61 ...
##  $ Aroma.quality.before.shaking : num [1:21] 3 2.82 2.93 2.59 3.43 ...
##  $ Fruity.before.shaking        : num [1:21] 2.71 2.38 2.56 2.42 3.15 ...
##  $ Flower.before.shaking        : num [1:21] 2.28 2.28 1.96 1.91 2.15 ...
##  $ Spice.before.shaking         : num [1:21] 1.96 1.68 2.08 2.16 2.04 ...
##  $ Visual.intensity             : num [1:21] 4.32 3.22 3.54 2.89 4.39 ...
##  $ Nuance                       : num [1:21] 4 3 3.39 2.79 4.04 ...
##  $ Surface.feeling              : num [1:21] 3.27 2.81 3 2.54 3.38 ...
##  $ Odor.Intensity               : num [1:21] 3.41 3.37 3.25 3.16 3.54 ...
##  $ Quality.of.odour             : num [1:21] 3.31 3 2.93 2.88 3.36 ...
##  $ Fruity                       : num [1:21] 2.88 2.56 2.77 2.39 3.16 ...
##  $ Flower                       : num [1:21] 2.32 2.44 2.19 2.08 2.23 ...
##  $ Spice                        : num [1:21] 1.84 1.74 2.25 2.17 2.15 ...
##  $ Plante                       : num [1:21] 2 2 1.75 2.3 1.76 ...
##  $ Phenolic                     : num [1:21] 1.65 1.38 1.25 1.48 1.6 ...
##  $ Aroma.intensity              : num [1:21] 3.26 2.96 3.08 2.54 3.62 ...
##  $ Aroma.persistency            : num [1:21] 2.96 2.81 2.8 2.58 3.3 ...
##  $ Aroma.quality                : num [1:21] 3.2 2.93 3.08 2.48 3.46 ...
##  $ Attack.intensity             : num [1:21] 2.96 3.04 3.22 2.7 3.46 ...
##  $ Acidity                      : num [1:21] 2.11 2.11 2.18 3.18 2.57 ...
##  $ Astringency                  : num [1:21] 2.43 2.18 2.25 2.18 2.54 ...
##  $ Alcohol                      : num [1:21] 2.5 2.65 2.64 2.5 2.79 ...
##  $ Balance                      : num [1:21] 3.25 2.93 3.32 2.33 3.46 ...
##  $ Smooth                       : num [1:21] 2.73 2.5 2.68 1.68 3.04 ...
##  $ Bitterness                   : num [1:21] 1.93 1.93 2 1.96 2.07 ...
##  $ Intensity                    : num [1:21] 2.86 2.89 3.07 2.46 3.64 ...
##  $ Harmony                      : num [1:21] 3.14 2.96 3.14 2.04 3.64 ...
##  $ Overall.quality              : num [1:21] 3.39 3.21 3.54 2.46 3.74 ...
##  $ Typical                      : num [1:21] 3.25 3.04 3.18 2.25 3.44 ...
head(wineW,2)
## # A tibble: 2 x 32
##   rowname Label  Soil  Odor.Intensity.bef~ Aroma.quality.bef~ Fruity.before.sh~
##   <chr>   <fct>  <fct>               <dbl>              <dbl>             <dbl>
## 1 "2EL "  Saumur Env1                 3.07               3                 2.71
## 2 "1CHA"  Saumur Env1                 2.96               2.82              2.38
## # ... with 26 more variables: Flower.before.shaking <dbl>,
## #   Spice.before.shaking <dbl>, Visual.intensity <dbl>, Nuance <dbl>,
## #   Surface.feeling <dbl>, Odor.Intensity <dbl>, Quality.of.odour <dbl>,
## #   Fruity <dbl>, Flower <dbl>, Spice <dbl>, Plante <dbl>, Phenolic <dbl>,
## #   Aroma.intensity <dbl>, Aroma.persistency <dbl>, Aroma.quality <dbl>,
## #   Attack.intensity <dbl>, Acidity <dbl>, Astringency <dbl>, Alcohol <dbl>,
## #   Balance <dbl>, Smooth <dbl>, Bitterness <dbl>, Intensity <dbl>, ...

Summarizing or Aggregating the Data.

For wide data frames

wineW %>% select(4:32) %>% summarize(across(
  everything(),list(mean=mean,sd=sd))) %>% t
##                                         [,1]
## Odor.Intensity.before.shaking_mean 3.1111429
## Odor.Intensity.before.shaking_sd   0.2894526
## Aroma.quality.before.shaking_mean  3.0457619
## Aroma.quality.before.shaking_sd    0.2056397
## Fruity.before.shaking_mean         2.7140476
## Fruity.before.shaking_sd           0.2010245
## Flower.before.shaking_mean         2.0570000
## Flower.before.shaking_sd           0.1435949
## Spice.before.shaking_mean          1.9914762
## Spice.before.shaking_sd            0.2396620
## Visual.intensity_mean              3.9575238
## Visual.intensity_sd                0.5494639
## Nuance_mean                        3.7273810
## Nuance_sd                          0.5207391
## Surface.feeling_mean               3.1802857
## Surface.feeling_sd                 0.2798259
## Odor.Intensity_mean                3.3952381
## Odor.Intensity_sd                  0.2004894
## Quality.of.odour_mean              3.2373333
## Quality.of.odour_sd                0.2231021
## Fruity_mean                        2.8471429
## Fruity_sd                          0.2215963
## Flower_mean                        2.1621429
## Flower_sd                          0.1507466
## Spice_mean                         2.1783333
## Spice_sd                           0.2063411
## Plante_mean                        1.9644286
## Plante_sd                          0.1634933
## Phenolic_mean                      1.5423810
## Phenolic_sd                        0.1525924
## Aroma.intensity_mean               3.1969524
## Aroma.intensity_sd                 0.2548253
## Aroma.persistency_mean             2.9827619
## Aroma.persistency_sd               0.2454870
## Aroma.quality_mean                 3.0636190
## Aroma.quality_sd                   0.3159952
## Attack.intensity_mean              3.1562857
## Attack.intensity_sd                0.3060454
## Acidity_mean                       2.3852857
## Acidity_sd                         0.2398940
## Astringency_mean                   2.4402857
## Astringency_sd                     0.1967954
## Alcohol_mean                       2.7406667
## Alcohol_sd                         0.1673895
## Balance_mean                       3.1290476
## Balance_sd                         0.3326547
## Smooth_mean                        2.6742381
## Smooth_sd                          0.4139867
## Bitterness_mean                    2.0679048
## Bitterness_sd                      0.1864001
## Intensity_mean                     3.1659524
## Intensity_sd                       0.3725651
## Harmony_mean                       3.1479524
## Harmony_sd                         0.4420390
## Overall.quality_mean               3.3311429
## Overall.quality_sd                 0.4320131
## Typical_mean                       3.1967619
## Typical_sd                         0.4257324

For long data frames

wineL %>% group_by(determination) %>% 
  summarize(mean=mean(value), sd=sd(value))
## # A tibble: 29 x 3
##    determination                 mean    sd
##    <chr>                        <dbl> <dbl>
##  1 Acidity                       2.39 0.240
##  2 Alcohol                       2.74 0.167
##  3 Aroma.intensity               3.20 0.255
##  4 Aroma.persistency             2.98 0.245
##  5 Aroma.quality                 3.06 0.316
##  6 Aroma.quality.before.shaking  3.05 0.206
##  7 Astringency                   2.44 0.197
##  8 Attack.intensity              3.16 0.306
##  9 Balance                       3.13 0.333
## 10 Bitterness                    2.07 0.186
## # ... with 19 more rows

Advanced Graphics in R

Grammar of Graphics (GoG)

The grammar of graphics is a theoretical approximation to the study of the graphics components. According to this, a graphical representation can be built from layers, by defining different elements

  • the identification of the data to be represented,
  • the association of the data variables to the graphical elements, and
  • the definition of the graphical geometry.

Additional elements of the representation can also be adjusted like scales, formats, annotations…, after the basic structure of the graphics is established.

Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.

ggplot2

ggplot2 is an R package which implements the grammar of graphics. A full reference can be found at http://ggplot2.tidyverse.org.

A current cheatsheet is available at https://www.rstudio.com/resources/cheatsheets/.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28.

ggplot2 is part of tidyverse package (although it can be installed and loaded independently).

if(!require("tidyverse")) {
  install.packages("tidyverse", repos="https://cloud.r-project.org/")
  library("tidyverse")
}

ggplot2 – Layers and Graphical Elements

In ggplot2, each graphical depiction of a data set is a layer. One or more layers make a graph.

Each layer is defined by the specification of its elements. The main ones are the data, the aesthetic mapping –the relation between the data and the graphical elements–, and the geometry.

In ggplot2, a plot is an R object and it can be stored. It is constructed additively, by summing different elements to the gg object constructed by the ggplot2 function.

For example,

plot1 <- ggplot(data = anscombe,
        mapping = aes(x = x1, y = y1))  # Data and aesthetic mapping 
plot1 <- plot1 + geom_point()       # Geometry
plot1

ggplot2 – Data

In ggplot2, the data is the first argument of the ggplot function. It is usually a data frame –or an object that can be converted to one–.

plot1 <- ggplot(data = anscombe,

ggplot2 – Aesthetic mapping

The aesthetic mapping establishes the relationships between data variables and graphical elements. It is the second argument of the ggplot function and it must be created from the aes support function.

        mapping = aes(x = x1, y = y1))

The most adequate mappings will depend on the type of variable and the type of plot –geometry–.

For quantitative variables, the most usual mappings are positions –x, y…–, size, and colors –color, fill–.

For qualitative variables, we opt for positions –x, y…–, colors –color, fill–, and shape as the mappings of preference.

ggplot2 – Geometry

The geometries (geom_...) indicate the format –type– of the plot, meaning how the graphical variables are put together. One or more geometries (each will be a layer) can be included in a plot by adding them to the gg object.

plot1 <- plot1 + geom_point()

The data and the aesthetic mapping are inherited from the gg object but can be overrode if defined as arguments in the geom_... function. CAUTION: the order of the arguments in the geom_... is reversed from the order in the ggplot function; aesthetic mapping precedes the data.

Common geometries are

  • geom_point
  • geom_line, geom_vline, geom_hline
  • geom_bar
  • geom_histogram
  • geom_boxplot

For each geometry, it is important to check what are the required and available mappings. This is available in the help pages and in the cheatsheet available at https://www.rstudio.com/resources/cheatsheets/.

ggplot2 Graphs

We will discuss here how to create the most common graphs in ggplot2. We will be adding specific considerations where required. We will explore

  • scatter plots,
  • histograms,
  • bar plots, and
  • boxplots.

By default, the plots in ggplot2 are constructed from a data frame from which the variables to be represented are selected.

We will use a sample of 1000 points from the diamonds data set –included in the ggplot2 package– to create the examples.

diaM <- diamonds[sample(1:nrow(diamonds),1000),]
str(diaM)
## tibble [1,000 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:1000] 0.78 0.51 0.91 1.24 1.04 0.23 0.71 0.4 1.54 0.93 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 3 4 5 3 1 3 5 5 4 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 4 6 2 4 3 3 5 2 1 7 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 4 3 3 2 1 7 4 4 4 4 ...
##  $ depth  : num [1:1000] 61.3 62.7 62.3 63.3 65.4 60.1 62.1 62.3 59.3 63.3 ...
##  $ table  : num [1:1000] 60 59 55 57 60 57 59 56 59 61 ...
##  $ price  : int [1:1000] 3012 984 5431 4462 2134 465 2572 912 16736 3208 ...
##  $ x      : num [1:1000] 5.89 5.06 6.2 6.85 6.39 4 5.73 4.72 7.54 6.19 ...
##  $ y      : num [1:1000] 5.96 5.09 6.23 6.79 6.3 4.02 5.68 4.75 7.61 6.14 ...
##  $ z      : num [1:1000] 3.63 3.18 3.87 4.32 4.15 2.41 3.55 2.95 4.49 3.9 ...

ggplot2 Graphs – Scatter Plot

ggplot(diaM, aes(x=carat,y=price)) + geom_point()

Adding a third variable and adjusting formats…

ggplot(diaM, aes(x=carat,y=price,color=cut)) + 
  geom_point(alpha=.8,shape=21,size=3)

With a trend line…

ggplot(diaM, aes(x=carat,y=price,color=cut)) + 
  geom_point(alpha=.8,shape=21,size=3) +
  geom_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

ggplot2 Graphs – Histogram

ggplot(diaM, aes(x=price)) + geom_histogram(binwidth=1000)

Adding the cut variable…

ggplot(diaM, aes(x=price,fill=cut)) +
  geom_histogram(position='dodge',binwidth=1000)

In relative frequencies, per group…

ggplot(diaM, aes(x=price,y=..density..,fill=cut)) +
  geom_histogram(position='dodge',binwidth=1000)

Let’s try with a density plot…

ggplot(diaM, aes(x=price,fill=cut)) +
  geom_density(alpha=.3)

ggplot2 Graphs – Bar Plot

ggplot(diaM, aes(x=clarity)) +
  geom_bar()

Adding the cut variable…

ggplot(diaM, aes(x=clarity, fill=cut)) +
  geom_bar()

To compare absolute frequencies, dodged bars work better…

ggplot(diaM, aes(x=clarity, fill=cut)) +
  geom_bar(position="dodge")

To compare cumulative relative frequencies, per-class stacked bars are more useful.

ggplot(diaM, aes(x=clarity, fill=cut)) +
  geom_bar(position="fill")

Bar plots can also be constructed from frequency counts. In this case, the y variable has to be stated and the argument stat="identity" has to be added to the geom_bar.

ggplot2 Graphs – Boxplot

ggplot(diaM, aes(y=price)) +
  geom_boxplot()

Adding the cut variable…

ggplot(diaM, aes(x=cut, y=price)) +
  geom_boxplot()

The chart can be improved by showing all the data points with the position modified with some random noise –jittered–.

ggplot(diaM, aes(x=cut, y=price)) + 
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(shape = 21, alpha=.5,height=0,width=.2)

Or including a violin plot, geom_violin, and an addtional point with the the mean.

means <- diaM %>% group_by(cut) %>%
  summarise(price=mean(price))

ggplot(diaM, aes(x=cut, y=price)) + 
  geom_violin() +
  geom_boxplot(outlier.shape = NA, width = 0.1) +
  geom_point(data=means,shape=3)

ggplot2 – Additional Elements

Beyond the already presented elements –data, aesthetic mapping, and geometries–, other ggplot2 elements help adjust many other aspects of the plots, for example

  • scale_... controls aspects related to the presentation of the graphical variables,
  • coord_... adjustes the coordinate system used in the plot,
  • labs sets plot and axes titles,
  • theme, theme_... control the non-data ink elements of the plot, and
  • facet_ facilitates the creation of multiple plots.

One last example…

ggplot(diaM, aes(x=carat, y = price,
                 shape = cut, col = clarity)) +
  geom_point(alpha=.6)
## Warning: Using shapes for an ordinal variable is not advised

ggplot(diaM, aes(x=carat, y = price,
                 shape = cut, col = clarity)) +
  geom_point(alpha=.6) +
  scale_x_continuous(breaks=1:3) +
  scale_y_continuous(trans="log10") +
  scale_color_brewer(palette="Spectral")+
  facet_grid(cut ~ clarity) + 
  theme_bw() + 
  theme(legend.position = "none",
        text = element_text(size=10))
## Warning: Using shapes for an ordinal variable is not advised

YOUR TURN

  1. Revisit the OrchardSprays data set and make two plots that highlight the conclusions that can be reached from analyzing it.